Quarterly and Annual Ridership Totals by Mode of Transportation 1
The initial piece of data that was gathered comes from the American Public Transportation Association, and can serve as an introductory synopsis of the state of public transit ridership over time. This gives a broad view of quarterly ridership across the entire country from 1990 onward. Thus, this data has been chosen for the potential of setting the stage for the problem which we intend to explore.
The raw data and methodology for how it was obtained can be found using this link: https://www.apta.com/research-technical-resources/transit-statistics/ridership-report/
The data itself can be downloaded using this link: https://www.apta.com/wp-content/uploads/APTA-Ridership-by-Mode-and-Quarter-1990-Present.xlsx
To download this data, I used an R API tool, which saves the data in Excel format. Below is the code for this action and a screenshot of the raw data to illustrate its form upon download:
An essential part of understanding public perception of a topic is by assessing how it is covered in the news. This often informs general opinions, and can introduce conversations that had not previously been in the zeitgeist. Thus, this paper will analyze text data from https://newsapi.org/ to allow us to study news coverage on two distinct public transit systems.
For this project, I will be looking at data regarding the Washington Metropolitan Area Transit Authority (WMATA) and the Bay Area Rapid Transit (BART). Both of these transit systems have several advantages for academic study: they are both large networks with rich histories and connections to their respective cities, there exist robust data sources allowing us to analyze information from several angles, and comparing them will allow us to get perspectives on differences between cities on opposite coasts.
The following shows how I accessed this News API via Python code. This outputs a JSON file as raw data, the start of which is included below each code block to show the nature of the data prior to cleaning.
It is reasonable to hypothesize that one of the main factors in public transit usage is people commuting to and from work. The term “rush hour” is a seemingly daily phrase, meaning the times in the morning and evening at which most people go to or return from their occupation. Thus, when COVID-19 struck and many workers were no longer expected to go to work in-person, the need for public transportation decreased drastically.
In the years since, remote work has been a topic of controversy. Many workers enjoy the benefits of privacy and the added time of not having to commute, while employers often cite advantages of being on-site even in office jobs. While in-person work has rebounded recently, much like public transit usage, it has not nearly returned to the prevalence of prior to the pandemic. Therefore, understanding trends surrounding remote work can provide insights on how to analyze public transportation trends.
WFH Research has exhaustive data sets regarding remote work information. For the purposes of this project, we will take into account three data sets. To better understand the controversial aspects of remote work, the first two data sets contain survey information from (a) employers and (b) workers on what they desire in terms of average remote work days per week. The third data set provides time series information on the amount of working from home (percent of full paid days) for large cities, including Washington, D.C. and the San Francisco Bay Area. Screenshots of the raw data are shown below:
Now that we have background information regarding both cities of concern, the next piece of information to gather is public transit ridership. This will give us comprehensive monthly data from 2018 to 2023 to provide insights on the nature of the decline in public transit, as well as the current recovery.
For WMATA, the data comes from https://www.wmata.com/initiatives/ridership-portal/, and gives simple average daily entries per month. Meanwhile, BART data comes from https://www.bart.gov/about/reports/ridership, which provides reports each month on entries and exits by station. The raw data for WMATA entries, as well as the most recent BART report, are shown below:
WMATA Average Daily Entries by Month
September 2023 Report for BART Entries/Exits by Station
Ridership by Hour
In addition to the volume of public transit usage, we can glean information on the purpose of public transit usage by analyzing the users by hour of the day. High peaks during “rush hour” likely indicate a great influence of work commuting on the data. Because of this, I downloaded both pre-pandemic and post-pandemic data sets regarding WMATA ridership by hour to view this relationship and whether it has changed due to new circumstances. In this case, March 17, 2020 is chosen as the demarcation date, as that was the day in which the first social distancing precautions were announced in Washington, D.C. The data is shown below:
In answering the question of whether or not public transit’s public service should be the paramount consideration for its efficacy, it is important to understand that it often provides service disproportionally to underprivileged groups. By analyzing demographic data, we can gather insights on who benefits most from robust public transit systems. To address this, there is data from the U.S. Census Bureau that provides 5-year estimates from 2021 of means of transportation to work by selected characteristics. The raw data is shown below:
2021 5-Year Estimate of Transportation Means by Demographic
Similar to the previous dataset, we want to further address the idea that public transit affects different groups of people to varying extents. Therefore, a source of data that will be useful is a survey dataset from ipums.org, which has millions of survey responses from the U.S. Census Bureau. The main reason for obtaining this data is the presence of a “Means of transportation to work” field, which will serve as our labels. Additionally, various demographics fields can be used to perform Naive Bayes classification later on. This data was obtained by submitting a download request for 2021 Census data and the following fields below:
Gauging public sentiment regarding public transit systems can be a great way to analyze the relationship between said system and the residents of its respective city. Regardless of external factors, consumer dissatisfaction of a mode of transportation could greatly influence its usage when other methods are readily available for many. Thus, these datasets will feature Yelp reviews of both the WMATA and BART systems, including the date of review, the exact text, and the associated numerical rating (1-5 stars). Gathering labeled text data will be invaluable for Naive Bayes classification in the future. To accomplish this, I will use the BeautifulSoup package in Python, which facilitates web scraping via HTML codes. Since the reviews span several pages, it is necessary to iterate over each page to obtain every review available to us. The code for this is below, along with screenshots of the raw data after being collated into dataframes.
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
WMATA Yelp Reviews
Code
r = requests.get('https://www.yelp.com/biz/bart-bay-area-rapid-transit-oakland-2')soup = BeautifulSoup(r.text, 'html.parser')hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')num_rating1 = []review_date1 = []review_text1 = []for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)for i inrange(1,101): r = requests.get('https://www.yelp.com/biz/bart-bay-area-rapid-transit-oakland-2?start='+str(i) +'0') soup = BeautifulSoup(r.text, 'html.parser') hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)bart_reviews = pd.DataFrame(list(zip(num_rating1,review_date1,review_text1)))bart_reviews.to_csv('../data/yelp_reviews/bart_reviews.csv')
BART Yelp Reviews
APTA
Public Transit Data by City
Code
r = requests.get('https://www.yelp.com/biz/metropolitan-transportation-authority-new-york-6')soup = BeautifulSoup(r.text, 'html.parser')hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')num_rating1 = []review_date1 = []review_text1 = []for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)for i inrange(1,14): r = requests.get('https://www.yelp.com/biz/metropolitan-transportation-authority-new-york-6?start='+str(i) +'0') soup = BeautifulSoup(r.text, 'html.parser') hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)mta_reviews = pd.DataFrame(list(zip(num_rating1,review_date1,review_text1)))mta_reviews.to_csv('../data/yelp_reviews/mta_reviews.csv')
Code
r = requests.get('https://www.yelp.com/biz/metro-los-angeles-los-angeles')soup = BeautifulSoup(r.text, 'html.parser')hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')num_rating1 = []review_date1 = []review_text1 = []for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)for i inrange(1,18): r = requests.get('https://www.yelp.com/biz/metro-los-angeles-los-angeles?start='+str(i) +'0') soup = BeautifulSoup(r.text, 'html.parser') hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)la_reviews = pd.DataFrame(list(zip(num_rating1,review_date1,review_text1)))la_reviews.to_csv('../data/yelp_reviews/la_reviews.csv')
Code
r = requests.get('https://www.yelp.com/biz/chicago-transit-authority-chicago-6')soup = BeautifulSoup(r.text, 'html.parser')hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')num_rating1 = []review_date1 = []review_text1 = []for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)for i inrange(1,38): r = requests.get('https://www.yelp.com/biz/chicago-transit-authority-chicago-6?start='+str(i) +'0') soup = BeautifulSoup(r.text, 'html.parser') hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)cta_reviews = pd.DataFrame(list(zip(num_rating1,review_date1,review_text1)))cta_reviews.to_csv('../data/yelp_reviews/cta_reviews.csv')
Code
r = requests.get('https://www.yelp.com/biz/septa-philadelphia-7')soup = BeautifulSoup(r.text, 'html.parser')hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')num_rating1 = []review_date1 = []review_text1 = []for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)for i inrange(1,10): r = requests.get('https://www.yelp.com/biz/septa-philadelphia-7?start='+str(i) +'0') soup = BeautifulSoup(r.text, 'html.parser') hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)septa_reviews = pd.DataFrame(list(zip(num_rating1,review_date1,review_text1)))septa_reviews.to_csv('../data/yelp_reviews/septa_reviews.csv')
Code
r = requests.get('https://www.yelp.com/biz/massachusetts-bay-transportation-authority-boston')soup = BeautifulSoup(r.text, 'html.parser')hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')num_rating1 = []review_date1 = []review_text1 = []for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)for i inrange(1,34): r = requests.get('https://www.yelp.com/biz/massachusetts-bay-transportation-authority-boston?start='+str(i) +'0') soup = BeautifulSoup(r.text, 'html.parser') hold1 = soup.find(string='Recommended Reviews')if hold1 isnotNone: reviews = hold1.find_parent('section')for review in reviews.select('div[aria-label$="star rating"]'): num_rating1.append(review['aria-label']) review_date1.append(review.find_next('span').text) hold = review.find_next('span', lang=True)if hold isNone: review_text1.append("NA")else: review_text1.append(hold.text)mbta_reviews = pd.DataFrame(list(zip(num_rating1,review_date1,review_text1)))mbta_reviews.to_csv('../data/yelp_reviews/mbta_reviews.csv')
Footnotes
“Ridership Report.” American Public Transportation Association, 21 Sept. 2023, www.apta.com/research-technical-resources/transit-statistics/ridership-report/.↩︎
“News API – Search News and Blog Articles on the Web.” News API Â Search News and Blog Articles on the Web, newsapi.org/. Accessed 12 Oct. 2023.↩︎
Barrero, Jose Maria, et al. Why Working from Home Will Stick, 2021, https://doi.org/10.3386/w28731.↩︎
“Ridership Reports.” Ridership Reports | Bay Area Rapid Transit, www.bart.gov/about/reports/ridership. Accessed 13 Oct. 2023.↩︎
U.S. Census Bureau. “MEANS OF TRANSPORTATION TO WORK BY SELECTED CHARACTERISTICS.” American Community Survey, ACS 5-Year Estimates Subject Tables, Table S0802, 2021, https://data.census.gov/table/ACSST5Y2021.S0802?t=Commuting&g=860XX00US20020,20032. Accessed on October 12, 2023.↩︎
Steven Ruggles, Sarah Flood, Matthew Sobek, Danika Brockman, Grace Cooper, Stephanie Richards, and Megan Schouweiler. IPUMS USA: Version 13.0 [dataset]. Minneapolis, MN: IPUMS, 2023. https://doi.org/10.18128/D010.V13.0↩︎